String Kernel-Based Techniques for Native Language Identification

نویسندگان

چکیده

Abstract In recent years, Native Language Identification (NLI) has shown significant interest in computational linguistics. NLI uses an author’s speech or writing a second language to figure out their native language. This may find applications forensic linguistics, teaching, acquisition, authorship attribution, identification of spam emails phishing websites, etc. Conventional pairwise string comparison techniques are computationally expensive and time-consuming. paper presents fast based on kernels such as spectrum, presence bits, intersection incorporating different learners Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting-XGBoost (XGB). Feature sets for the proposed generated using combinations features n-word grams noun phrases. Experimental analyses carried 8235 English articles from 10 linguistic backgrounds typical NLP benchmark dataset. The experimental results show that technique spectrum kernel with RF classifier outperformed existing character n-gram SVM, RF, XGB classifiers. Also, comparable were observed among kernels. Interestingly, random forest SVM classifiers feature sets. All demonstrated promising improvement training time, best result attaining more than 95 percent decrease time. reduced time makes it well suited scale production.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Native Language Identification Based on English Accent

Present work is aimed at investigating the influence of mother tongue (L1) of a South Indian speaker on a second language (L2). Second language can be a dominant local language, national language in India i.e., Hindi or a connecting language English. In the current study, L2 is a short discourse in English. Cepstral and prosodic features were used as in Language Identification (LID) to distingu...

متن کامل

Finnish Native Language Identification

We outline the first application of Native Language Identification (NLI) to Finnish learner data. NLI is the task of predicting an author’s first language using writings in an acquired language. Using data from a new learner corpus of Finnish — a language typology quite different from others previously investigated, with its morphological richness potentially causing difficulties — we show that...

متن کامل

Norwegian Native Language Identification

We present a study of Native Language Identification (NLI) using data from learners of Norwegian, a language not yet used for this task. NLI is the task of predicting a writer’s first language using only their writings in a learned language. We find that three feature types, function words, part-of-speech n-grams and a hybrid part-of-speech/function word mixture n-gram model are useful here. Ou...

متن کامل

Multilingual native language identification

We present the first study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author’s first language (L1) using only their writings in a second language (L2), with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but the...

متن کامل

Arabic Native Language Identification

In this paper we present the first application of Native Language Identification (NLI) to Arabic learner data. NLI, the task of predicting a writer’s first language from their writing in other languages has been mostly investigated with English data, but is now expanding to other languages. We use L2 texts from the newly released Arabic Learner Corpus and with a combination of three syntactic f...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Human-centric intelligent systems

سال: 2023

ISSN: ['2667-1336']

DOI: https://doi.org/10.1007/s44230-023-00029-z